首页> 外文OA文献 >Multi-agent Inverse Reinforcement Learning for Zero-sum Games
【2h】

Multi-agent Inverse Reinforcement Learning for Zero-sum Games

机译:零和游戏的多智能体逆强化学习

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

In this paper we introduce a Bayesian framework for solving a class ofproblems termed Multi-agent Inverse Reinforcement Learning (MIRL). Compared tothe well-known Inverse Reinforcement Learning (IRL) problem, MIRL is formalizedin the context of a stochastic game rather than a Markov decision process(MDP). Games bring two primary challenges: First, the concept of optimality,central to MDPs, loses its meaning and must be replaced with a more generalsolution concept, such as the Nash equilibrium. Second, the non-uniqueness ofequilibria means that in MIRL, in addition to multiple reasonable solutions fora given inversion model, there may be multiple inversion models that are allequally sensible approaches to solving the problem. We establish a theoreticalfoundation for competitive two-agent MIRL problems and propose a Bayesianoptimization algorithm to solve the problem. We focus on the case of two-personzero-sum stochastic games, developing a generative model for the likelihood ofunknown rewards of agents given observed game play assuming that the two agentsfollow a minimax bipolicy. As a numerical illustration, we apply our method inthe context of an abstract soccer game. For the soccer game, we investigaterelationships between the extent of prior information and the quality oflearned rewards. Results suggest that covariance structure is more importantthan mean value in reward priors.
机译:在本文中,我们介绍了一种贝叶斯框架,用于解决一类称为多主体逆向强化学习(MIRL)的问题。与众所周知的逆强化学习(IRL)问题相比,MIRL是在随机博弈而不是马尔可夫决策过程(MDP)的背景下形式化的。博弈带来了两个主要挑战:首先,对于MDP至关重要的最优性概念失去了意义,必须用更笼统的解决方案代替,例如纳什均衡。其次,均衡的非唯一性意味着在MIRL中,除了给定反演模型有多个合理的解决方案外,可能还有多个反演模型是解决问题的所有合理方法。我们为竞争性的两主体MIRL问题建立了理论基础,并提出了贝叶斯优化算法来解决该问题。我们关注两人零和随机游戏的情况,针对观察到的游戏玩法,假设两个特工遵循极小极大双策略,开发一个生成模型,以了解特工未知奖励的可能性。作为数字说明,我们将我们的方法应用于抽象足球比赛中。对于足球比赛,我们研究了先验信息的程度与学习奖励的质量之间的关系。结果表明,协方差结构比奖励先验中的平均值更重要。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号